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Abstract 

The constraints arising from DAG mod- 
els with latent variables can be naturally 
represented by means of acyclic directed 
mixed graphs (ADMGs). Such graphs 
contain directed (— t-) and bidirected (o) 
arrows, and contain no directed cycles. 
DAGs with latent variables imply in- 
dependence constraints in the distribu- 
tion resulting from a 'fixing' operation, 
in which a joint distribution is divided 
by a conditional. This operation gen- 
eralizes marginalizing and conditioning. 
Some of these constraints correspond to 
identifiable 'dormant' independence con- 
straints, with the well known 'Verma 
constraint' as one example. Recently, 
models defined by a set of the constraints 
arising after fixing from a DAG with la- 
tents, were characterized via a recursive 
factorization and a nested Markov prop- 
erty. In addition, a parameterization 
was given in the discrete case. In this 
paper we use this parameterization to 
describe a parameter fitting algorithm, 
and a search and score structure learn- 
ing algorithm for these nested Markov 
models. We apply our algorithms to a 
variety of datasets. 

1 Introduction 

Many data-generating process correspond to dis- 
tributions that factorize according to a directed 
acyclic graph (DAG). Such models also have an 



intuitive causal interpretation: an arrow from a 
variable X to a variable y in a DAG model can 
be interpreted, in a way that can be made precise, 
to mean that X is a "direct cause" of Y. 

In many contexts we do not observe all of the vari- 
ables in the data-generating process. This cre- 
ates major challenges for structure learning and 
for identifying causal intervention distributions. 
While existing machinery based on DAGs with la- 
tent variables can be applied to such settings, this 
creates a number of problems. First, there will in 
general be an infinite number of DAG models such 
that a particular margin of these models may rep- 
resent the observed distribution; this is still true 
if we require the graph to be faithful. Second, 
prior knowledge about latent variables is often 
scarce, which implies any modeling assumptions 
made by explicitly representing latents leaves one 
open to model misspecification bias. An alter- 
native approach, is to consider graphical mod- 
els represented by graphs containing directed and 
bidirected edges, called Acyclic Directed Mixed 
Graphs (ADMGs). In a companion paper we de- 
fine a 'nested' Markov property for ADMGs that 
encodes independence constraints under a 'fixing' 
operation that divides the joint distribution by a 
conditional density. Given a DAG G with latent 
variables there is an ADMG G* naturally associ- 
ated with G via the operation of 'latent projec- 
tion' [19] ; the vertices of G* are solely the subset 
of vertices in G that are observed. We show that 
the observed distribution resulting from the DAG 
with latent variables G obeys the nested Markov 
property associated with the corresponding latent 
projection G*- 




Figure 1: (a) A latent variable DAG not entailing 
any d-separation statements on xi,X2,xs, X4. (b) 
An ADMG G with missing edges representing a 
saturated nested Markov model. 

Previous work [3j has given a discrete parameter- 
ization and ML fitting algorithms, as well as a 
characterization of model equivalence for mixed 
graph models representing conditional indepen- 
dences [T]. It is well-known, however, that mod- 
els representing DAG marginals can contain non- 
parametric constraints which cannot be repre- 
sented as conditional independence constraints. 
For instance, in any density P Markov relative 
to a DAG with latents represented by the graph 
shown in Fig. [l](a), it is known (see [191 [13]) that 

d 

'^P{X4\ XI,X2,X3)P{X2\ Xi) =0 (1) 

X2 

This constraint can be viewed as stating that 
is independent of Xi in the distribution obtained 
from P{xi, X2, X3, Xi) after dividing by a condi- 
tional P{x3\x2, xi) [12]. Also note that the ex- 
pression ([1]) is an instance of the g- formula of [13] . 
If the graph shown in Fig. [T] (a) is causal, then 
this constraint can be interpreted as an (iden- 
tifiable) dormant independence constraint [15], 
which states that X4 is independent of Xi given 
do{x3), where do{.) denotes an intervention [7J. 

Since the DAG in Fig. [l](a) implies no conditional 
independences on the 4 observable variables, the 
set of marginal distributions on xi, X2, X3, X4 ob- 
tained from densities over xi, X2, x^, x^, hi, h2 
Markov relative to this DAG is a saturated model 
when viewed clS ct model of conditional indepen- 
dence. Thus, any structure learning algorithm 
which only relies on conditional independence 
constraints will return a (maximally uninforma- 
tive) unoriented complete graph when given data 
sampled from one of such marginal distributions. 

Nevertheless, it is possible to use constraints such 



as (1), which we call post-truncation indepen- 
dences or reweighted independences to distinguish 
between models, with appropriate assumptions. 
In [16], such constraints were used to test for 
the presence of certain direct causal effects (rep- 
resented by directed arrows in graphical causal 
models). A recent paper [11] has given a nested 
factorization for mixed graph models which im- 
plies, along with the standard conditional inde- 
pendences in mixed graphs, post-truncation inde- 
pendences of the type shown in (1). Furthermore, 
a parameterization for discrete models based on 
this factorization was given. Another recent pa- 
per |17j has taken advantage of this parameteri- 
zation to give a general algorithm for efficiently 
computing causal effects. In this paper, we take 
advantage of this parameterization to give a max- 
imum likelihood parameter fitting algorithm for 
mixed graph models of post-truncation indepen- 
dence. Furthermore, we use this algorithm to con- 
struct a search and score algorithm based on BIG 
|14j for learning mixed graph structure while tak- 
ing advantage of post-truncation independences. 

The paper is organized as follows. In section 2, we 
introduce the graphical, and probabilistic prelim- 
inaries necessary for the remainder of the paper. 
In section 3, we introduce nested Markov models. 
In section 4, we give a parameterization of discrete 
nested Markov models. In section 5, we describe 
the parameter fitting algorithm. In section 6, we 
describe the search and score algorithm. Section 
7 contains our experiments. Section 8 gives our 
conjecture for the characterization of equivalence 
classes of mixed graphs over four nodes, and gives 
experimental evidence in favor of our conjecture. 
Section 9 contains the discussion and concluding 
remarks. 



2 Preliminaries 

A directed mixed graph Q{V, E) is a graph with a 
set of vertices V and a set of edges E which may 
contain directed (— )•) and bidirected (<-)•) edges. A 
directed cycle is a path of the form j; —)••••—)• y 
along with an edge y ^ x. An acyclic directed 
mixed graph (ADMG) is a mixed graph contain- 
ing no directed cycles. 



2.1 Conditional ADMGs 



to whether they are in y or VF is preserved. 



A conditional acyclic directed mixed graph 
(CADMG) g{V, W, E) is an ADMG with a vertex 
set FUiy, where FriVF = 0, subject to the restric- 
tion that for all w G W ^ pag(u;) = = s\ig{w). 

Whereas an ADMG with vertex set V represents 
a joint density p{xv)^ a conditional ADMG is a 
graph with two disjoint sets of vertices, V and W 
that is used to represent the Markov structure of a 
'kernel' qy{xy\xw)- Following [6", p. 46], we define 
a kernel to be a non-negative function qv{xv\xw) 
satisfying: 

qv{xv I Xw) = 1 for all xw G '^w- 

(2) 

We use the term 'kernel' and write qv{-\-) (rather 
than p{-\-)) to emphasize that these functions, 
though they satisfy ^ and thus most proper- 
ties of conditional densities, will not, in general, 
be formed via the usual operation of condition- 
ing on the event Xw = xw- To conform with 
standard notation for densities, we define for ev- 
ery ACV, qv{xA\xw) = 'EvXA'lvixvlxw), and 

qv{xy\A\xwuA) = 

For a CADMG g{V,W,E) we consider collec- 
tions of random variables (At,)t,gy taking values 
in probability spaces (X^)^^^ conditional on vari- 
ables {Xw)^^w with state spaces {3iw)wew- In 
all the cases we consider the probability spaces 
are either real finite-dimensional vector spaces 
or finite discrete sets. For A C y U we let 
= XueAi^u), and Xa = {X^)y(zA- We use 
the usual shorthand notation: v denotes a vertex 
and a random variable X^, likewise A denotes a 
vertex set and Xa- It is because we will always 
condition on the variables in W that we do not 
permit edges between vertices in W. 

An ADMG g{V,E) may be seen as a CADMG 
in which = 0. In this manner, though we will 
state subsequent definitions for C ADMGs, they 
will also apply to ADMGs. 

The induced subgraph of a CADMG g{V,W,E) 
given by set A, denoted consists of g{V (1 
A,W n A,Ea), where Ea is the set of edges in 
g with both endpoints in A. Note that in form- 
ing gA , the status of the vertices in A with regard 



2.2 Districts 

A set C is connected in g if every pair of vertices 
in C are connected by a path with every vertex 
on the path contained in C. A connected set C in 
an ADMG g is inclusion maximal if no superset 
of C is connected. 

For a given CADMG g{V, W, E), the induced hidi- 
rected graph {g)^ is the CADMG formed by re- 
moving all directed edges from g. Similarly, {g)^ 
is formed by removing all bidirected edges. A set 
connected in (^)«. is called bidirected connected. 

For a given vertex x G V in g, the district 
(c-component) of x, denoted by disg{x) is the 
connected component of (^)«.. Districts in an 
ADMG g{V, E) form a partition of V. In a DAG 
g{V,E) the set of districts is the set of all single 
element node sets of V . In a CADMG, all dis- 
tricts are subsets of V , the nodes of W are not 
included by definition. For an induced subgraph 
gAi we write disA^x) as a shorthand for disg^{x). 

2.3 The fixing operation and fixable 
vertices 

We now introduce a 'fixing' operation on an 
ADMG or CADMG that has the effect of trans- 
forming a random vertex into a fixed vertex, 
thereby changing the graph. However, in general 
this operation may only be applied to a subset of 
the vertices in the graph, which we term the set 
of (potentially) fixable vertices. 

Definition 1 Given a CADMG g{V,W) the set 
o/ fixable vertices, 

¥{g) = {v\veV, d\sg{v) n deg(w) = {v}] . 

In words, a vertex v is fixable in g if there is no 
vertex v* that is both a descendant of v and in 
the same district as v m.g. 

Definition 2 Given a CADMG g{V,W,E), and 
a kernel qv{Xy \ Xy/), for every r G F(^) we 
associate a fixing transformation (jj^ on the pair 
{g,qv{Xv I X\y)) defined as follows: 

(t^r{g)^g*{V\{r],WU{r],Er), 



where Ey is the subset of edges in E that do not 
have arrowheads into r, and 



4>r{qv{xv I xw)]G) 



qv{xv I xw) 



qv{Xr I Xi„bg(r ,ang (disg (r 

))))■ 

(3) 



We use o to indicate composition of oper- 
ations in the natural way, so that: (pr ° 

(t>s{Q) = <t)r{(l>s{Q)) and <t)r o (psiqvi^vlXw); Q) = 
(t>r {(t>s {qv{Xv\Xw);Q) ; (t^siQ))- 

2.4 Reachable and Intrinsic Sets 

In order to define our factorization, we will need 
to define special classes of vertex sets in ADMGs. 

Definition 3 A CADMG g{V,W) is reachable 
from an ADMG Q* {V yjW) if there is an ordering 
of the vertices inW = {wi, . . . ,Wk), such that for 
j = l,...,k, 

wi £ F(^*) and for j = 2, . . . , k, 
Wj G ¥{(j)wj_-^ o ■■■ o 

In words, a subgraph is reachable if, under some 
ordering, each of the vertices in W may be 
fixed, first in Q* , and then in (pwiiG*), then in 
(l}w2{^wAS*))^ and so on. If a CADMG g{V,W) 
is reachable from g*{V U W), we say that the set 
V is reachable in Q* . Note that a reachable set 
i? in ^ may be obtained by fixing vertices using 
more than one valid sequence. We will denote any 
valid composition of fixing operations that fixes a 
set A by (pA if applied to the graph, and by cpx^ if 
applied to a kernel. Note that with a slight abuse 
of notation (though justified as we will later see) 
we suppress the precise fixing sequence chosen. 

Definition 4 A set C is intrinsic in Q if it is a 
district in a reachable subgraph of Q . The set of 
intrinsic sets in an ADMG Q is denoted byX[Q). 

Note that in any DAG Q{V,E), X{g) = {{x]\x e 
V}, while in any bidirected graph is equal 

to the set of all connected sets in Q. 

3 Nested Markov Models 

We define a factorization on probability distribu- 
tions represented by ADMGs via intrinsic sets. 



Definition 5 (nested factorization) Let 

Q{V,E) be an ADMG. A distribution p{Xy) obeys 
the nested factorization according toQ{V, E) if for 
every reachable subset A OV, cpXy^^iPi^v)'^ ^) ~ 

UDeV{<l>A{g)) fD{xD\Xps.g(D)\D)- 

A distribution p{xv) that obeys the nested fac- 
torization with respect to Q is said to be in the 
nested Markov model of Q. 

Theorem 6 If p{xy) is in the nested Markov 
model of Q, then for any reachable set A in 
Q, any valid fixing sequence on V \ A gives 
the same GADMG over A, and the same kernel 
qA{xA\xv\A) obtained from p{xv) ■ 

Due to this theorem, our decision to suppress the 
precise fixing sequence from the fixing operation 
is justified. 

It is known that nested Markov factorization im- 
plies the global Markov property for ADMGs. 

Theorem 7 If a distribution p{Xv) is in the 
nested Markov model for Q{V,E) then p{Xy) 
obeys the global Markov property for Q {V, E) . 

The proof appears in [11]. This result implies 
nested Markov models capture all conditional in- 
dependence statements normally associated with 
mixed graphs. In addition, nested Markov models 
capture additional independence constraints that 
manifest after truncation operations. For exam- 
ple, all distributions contained in the model that 
factorizes according to the graph shown in Fig. 
[T] (a) , obey the constraint that Xi is independent 
of X4 after "truncating out" (that is, dividing by) 

P{x?, I X2,l). 

4 Parameterization of Binary Nested 
Markov Models 

We now give a parameterization of nested Markov 
models. The approach generalizes in a straight- 
forward way to finite discrete state spaces. 

4.1 Heads and Tails of Intrinsic Sets 

Definition 8 For an intrinsic set C G X{Q) of 
a CADMG Q, define the recursive head (rh) as: 
rh{C) = {x\x G C;chg^{x) = 0}. 



Definition 9 The tail associated with a recursive 
head H of an intrinsic set C in a CADMG Q is 
given by: tail(iJ) = {C \ H) U pag(C). 

4.2 Binary Parameterization 

Multivariate binary distributions which obey the 
nested factorization with respect to an CADMG 
Q may be parameterized by the following param- 
eters: 

Definition 10 The binary parameters associated 
with a CADMG Q are a set of functions: Qg = 
{qciXn = 0\x,,n(^H))\H = rh{C),C G I{g)} . 

Intuitively, a parameter qciXn = Olxtai^/f)) is 
the probability that the variable set Xh assumes 
values in a kernel obtained from p{xv) by fixing 
Xy\c^ ™d conditioning on X^^^^jjy As a short- 
hand, we will denote the parameter qciXn = 

0|Xtail(H)) as OH{Xts,il(H))- 

Definition 11 Let u : V U W ^ {0, 1} be an 

assignment of values to the variables indexed by 
V U W . Define y{T) to be the values assigned to 
variables indexed by a subset T C F U W . Let 
i/-i(0) = {v\ve y,y{v) = 0}. 

A distribution P{Xv \ X\y) is said to be parame- 
terized by the set £2g, for CADMG g if: 

p{Xv = iy{V) I Xw = vm) =E X 

B : !/-i(o)nvcscy 

where the empty product is defined to be 1, and 
iBjg is a partition of nodes in B given in Jj7| /. 

Note that this parameterization maps 6h param- 
eters to probabilities in a CADMG via an inverse 
Mobius transform. Note also that this parame- 
terization generalizes both the standard Markov 
parameterization of DAGs in terms of parameters 
of the form p{xi = 0| pa(xj)), and the parameter- 
ization of bidirected graph models given in [3] . 

4.3 Example 

Consider an ADMG Q shown in Fig. [T] (b). The 
parameters associated with a binary model repre- 



sented by this graph are: 



71,4 



(X2, Xs), 9l{x2, X3), 6*2,3, Q2, 03, 6*4(3:2), 6*8,4(3:2) 



Each of these parameters are functions which map 
binary values to probabilities, which implies this 
binary model contains 15 parameters, in other 
words it is saturated. This is the case even though 
g is not a complete graph. A similar situation 
arises in mixed graph models of conditional in- 
dependence. In such models a model represented 
by a graph with missing edges may be saturated 
if nodes which are not direct neighbors are con- 
nected by an inducing path [18J. In particular, the 
mixed graph shown in Fig. [T] represents a satu- 
rated model of conditional independence because 
there is an inducing path between Xi and X4. 
The reason this graph does not represent a satu- 
rated nested Markov model is because truncations 
allow us to test independence of some pairs of non- 
adjacent nodes, even if they are connected by an 
inducing path. However, there are some inducing 
paths which are "dense" enough such that trun- 
cations cannot be used to test independence be- 
tween node pairs connected by such a path. Such 
a dense inducing path exists in the graph in Fig. 
[1] (b) between Xi and X4. 

As an illustration of our parameterization, for the 
graph in Fig. [T] (b), we have the following: 

p{xi — 0,X3 = 0, X4 — 0,X2 = 1) — 
0l,4,{x2 = l,X3 = 0) * 63 - 6l,4{x2 = 1, a;3 = 0) * 02,3 
p{xi = 0, = 0, X2 = 1, 2:4 = 1) = 
ei{x2 = 1, X3 = 0) * ^3 - ei{x2 = 1, 13 = 0) * 62,3 
-0l,4,{x2 = 1,X3 = 0) * 63 + Oi,4{x2 = 1, a;3 = 0) * 02,3 



5 Parameter Fitting for Binary 
Nested Markov Models 

We now describe a parameter fitting algorithm 



based on the parameterization in definition 11 



which relates the parameters of an nested Markov 
model and standard multinomial probabilities via 
the Mobius inversion formula. We first describe 
this mapping in more detail. 

For a given CADMG G, define Vg be the set of 
all multinomial probability vectors in the simplex 
A2\v\_i which obey the nested factorization ac- 
cording to Q, let Qg be the set of all vectors of 



parameters which define coherent nested Markov 
models. 



The mapping pg : Qg i— )• Vg in definition 11 can 
be viewed composition fxg o Tg of two map- 
pings. Here Tg maps Qg to the set of all terms of 
the form UnaBh (^taii(//) = i^(tail(i?))) com- 
posed of parameters in Qg . We denote this set by 
Tg ■ The second mapping fig maps Tg to Vg via an 
inverse Mobius transform. In [4J these mappings 
were defined via element-wise matrix operations, 
with Tg defined via a matrix P containing and 1 
entries, and fig defined via a matrix M containing 
0, 1, —1 entries. There may be more efficient rep- 
resentations of these mappings. In particular fig 
may be evaluated via the fast Mobius transform 
[5]. Such an efficient mapping was given in jl7j . 

Note that pg is smooth with respect to each pa- 
rameter. This implies we can solve many op- 
timization problems for functions expressed in 
terms of pg using standard iterative methods. 
The difficulty is that the fitting algorithm must 
be defined in such a way that each step that starts 
in the parameter space Qg stays in Qg. We use 
the approach taken in where a single step of 
the fitting algorithm updates the estimates for all 
and only parameters which refer to a particular 
vertex v G V. For a particular vector q G Qg, let 
q(u) be the set of parameters whose heads contain 
V. Let Qg{v) be the subset of Qg containing only 
such vectors. Then the restriction of pg to Qg{v) 
is a linear function since any such parameter oc- 
curs at most once in in a term in Tg. This im- 
plies that Pg can be expressed as * Qg{v) — 
for some matrix and vector b^,. To remain 
within Qg it suffices to maintain the constraint 
that Aj;q{v) > b„. 

5.1 Maximum Likelihood Parameter 
Fitting 

We are now ready to describe our parameter fit- 
ting algorithm. Our scheme closely follows that in 
albeit with a different parameterization. The 
algorithm iteratively updates parameters q(w) for 
every vertex v in turn, and at each step maxi- 
mizes the log likelihood via gradient ascent. For 
the purposes of this paper, we assume strictly pos- 
itive counts in our data. The case of zero counts 



Q-FIT{g,q, L{pg)) 

INPUT: G an ADMG, q a set of parameters defin- 
ing a model which obeys the nested factorization 
wrt g,L(pg) a concave function defined in terms 
of pg. 

OUTPUT: q, a local maximum in the surface de- 
fined on Qg via L{pg). 

Cycle through each vertex v va. and do 

1 Construct the constraint matrices 
Ay , b^ . 

2 Fit q(f ) to obtain new estimate q* max- 
imizing L{pg) subject to AyC\{v) > h^. 

3 If q* sufficiently close to q, return q* . 

4 Otherwise, set q to q*. 

Figure 2: A parameter fitting algorithm for nested 
Markov models. 



gives rise to certain statistical complications, and 
will be handled in subsequent work. 

For a particular vertex v, the function we 
are optimizing has the form log£g(q(w)) = 
Ui log /5g(q(w)) where pg is restricted to Qg{v) 
and is thus a linear function in q(f). 

Our fitting algorithm is given in Fig. [2j Our 
choice of L is the log likelihood function which is 
strictly concave in q(u) by above, while our initial 
guess for q are the parameters which define a fully 
independent model. The optimization problem in 
line 2 can be solved by standard gradient ascent 
methods. 

6 Structure Learning in Nested 
Markov Models 

A fitting algorithm which maximizes likelihood al- 
lows us to do structure learning in nested Markov 
models, using standard search and score methods 
which use likelihood-based scoring criteria such as 

BIC du. 

The algorithm is a standard greedy local search 
augmented with a tabu meta- heuristic. We found 
a meta-heuristic necessary for our search proce- 
dure because a complete theory of equivalence of 
models with respect to post-truncation indepen- 



dences is not yet available. Because we do not 
yet understand equivalence in this setting, we are 
unable to define efficient local steps which always 
move across equivalence classes as in the GES al- 
gorithm for DAGs [2j. Without such steps, in 
order to achieve reasonable local minima in the 
score surface, the algorithm must be able to move 
across score plateaus. 

For the purposes of our experiments, we used 
the BIG scoring function, although our approach 
does not require this, and any competing scoring 
method could have been used. We chose BIG due 
to its desirable asymptotic properties. 

We interpret the output of our search procedure 
to be the "best" mixed graph model under the 
assumption that every post-truncation indepen- 
dence observed in the data has a structural ex- 
planation. This assumption is a natural gen- 
eralization of the faithfulness p8], or stability 
[7] assumption from the conditional independence 
setting to the post-truncation independence set- 
ting. We do not pursue the precise statement of 
this assumption in this paper, since doing so en- 
tails defining a strong global Markov property for 
nested Markov models (the post-truncation ana- 
logue of d-separation in DAGs and m-separation 
in mixed graphs). This property is sufficiently 
intricate that its definition and properties are de- 
veloped in a companion paper. 



6.1 Implementation 

We implemented fitting and search algorithms us- 
ing the R language [8J. Our implementation was 
based on an older implementation of fitting and 
search for mixed graph models of conditional in- 
dependence [4J. 



7 Experiments 

To illustrate our search and score method, we used 
simulated data. 



7.1 Simulated Data from DAG Marginal 
Models With Post- Truncation 
Constraints 

To demonstrate that our algorithm can suc- 
cessfully learn "interesting" graphs distinct from 
known DAG and MAG equivalence classes, we 
have used search and score on simulated data ob- 
tained from DAG models with latent variables 
shown in Fig. [4j Both of these models are known 
to contain post-truncation independences. The 
observable nodes Xi, X2, X^, X4 in the graph 
shown in Fig. |4] (a) correspond to binary random 
variables, while the latent node U corresponds to 
a discrete random variable with 16 possible val- 
ues. Similarly, the observable nodes Xi, X2, X3, 
X4, X5 in the graph shown in Fig. |4] (b) corre- 
spond to binary random variables, while the la- 
tent nodes Ui, U2 correspond to discrete random 
variables with 8 possible values. 

The model shown in the graph in Fig. |4] (a) con- 
tains two independence constraints over observ- 
able variables. The first is an ordinary condi- 
tional independence constraint {Xi _LL X3IX2). 
The second is a post-truncation independence 
which states that Xi _LL X4 after truncating out 
P{x-i I 2:2 )• The model shown in the graph 
in Fig. [4] (b) contains three independence con- 
straints over observable variables. The first two 
are ordinary independence constraints which state 
that (X3 AL Xi\X2) and (X4 AL Xi,X2\Xs). 
The last is a post-truncation independence which 
states that ^4,^5 _LL Xi after truncating out 
P{x3 I X2). Note that X4 and can both be 
made conditionally independent of Xi, but not 
by using the same conditioning set. 

7.2 Results 

We chose parameters of the DAG models shown 
in Fig. [4] in such a way as to ensure "approxi- 
mately faithful" models. We then generated sam- 
ples from our models and retained the values of 
only Xi, X2, X^, X4 in the first model, and of only 
Xi, X2, Xs, X4, X5 in the second model. We eval- 
uated the performance of our structure learning 
algorithm on datasets ranging from 500 to 5000 
samples (in 500 sample increments). For each 
dataset size, we generated 1000 datasets randomly 



sample size vs probability of learning the true 4 node model 



sample size vs probability of learning the true 5 node model 
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Figure 3: (a) Probability of learning the true 4 node model vs sample size, (b) Probability of learning 
the true 5 node model vs sample size. 



from the true models. 

Figures [3] (a) and (b) show our results. The prob- 
ability of learning the true model grows linearly 
with sample size, with 672 4 node models out of 
1000 correctly recovered, and 765 5 node models 
out of 1000 correctly recovered, from 5000 sample 
datasets. By "correctly recovered" we mean that 
our search procedure returned one of the graphs 
shown in Fig. 7 in datasets obtained from a DAG 
model in Fig. 4 (a), and one of the graphs in the 
appropriate equivalence class in datasets obtained 
from a DAG model in Fig. |4] (b). Although there 
is currently no complete theory of observational 
equivalence of nested Markov models, as there is 
for Markov factorizing models, we do provide ev- 
idence for the characterization of all equivalence 
classes of 4 node models in the next section. 

8 Equivalence Conjecture for Mixed 
Graphs of Four Nodes 

When describing the search and score algorithm 
for nested Markov models, we mentioned that no 
complete theory of equivalence of models with re- 
spect to post-truncation independence currently 
exists. Characterization of equivalence does exist 
for DAGs [E], and MAGs [Ij. In this section, we 
present an investigation of the issue of equivalence 
with respect to post-truncation independence for 
the special case of mixed graphs of four nodes. 

It is known that there are exactly 543 four node 



DAGs (sequence A003024 in OEIS). This implies 
that there are 543 * 2^ = 34752 ADMGs with 
four nodes. These ADMGs are arranged into 
185 equivalence classes representing DAG mod- 
els of conditional independence, and 63 equiva- 
lence classes representing models of independence 
which can be represented by a mixed graph, but 
not a DAG, for a total of 248 equivalence classes. 
If we consider nested Markov models, which in 
addition to conditional independences imply post- 
truncation independences, we expect the num- 
ber of equivalence classes to expand, since two 
mixed graphs may agree on all conditional inde- 
pendences, but disagree on post-truncation inde- 
pendences. For instance, this is the case for the 
ADMG shown in Fig. [s] (a) and a complete DAG. 

We conjecture that for a given ADMG G, if 
the model of conditional independence |9] and 
a nested Markov model agree on the parameter 
count, then they define the same model [10]. This 
conjecture would imply that it is sufficient to char- 
acterize equivalence in ADMGs where the model 
in [9] and the nested Markov model give differ- 
ent parameter counts. There are exactly 228 such 
ADMGs. 

We conjecture that these 228 ADMGs are ar- 
ranged in 84 equivalence classes. These 84 classes 
fall in a small number of graph patterns with mul- 
tiple classes having the same pattern but different 
vertex labeling. Specifically, of the 84 classes, 24 
are of type (a), shown in Fig. [5] (a), 12 are of 



type (b), shown in Fig. [5] (b), 24 contain graphs 
with the patterns shown in Fig. [6j and 24 contain 
graphs with the patterns shown in Fig. [7j 

The model contained in one of 24 singleton equiv- 
alence classes shown in Fig. [s] (a) has a single 
post-truncation independence which states (up to 
node relabeling) that X4 _LL in the distri- 

bution obtained from X2, X3, X4) after trun- 

cating out P{X3 I X2,Xi). 

The model contained in one of 12 singleton equiv- 
alence classes shown in Fig. [5] (b) has a single 
post-truncation independence which states (up to 
node relabeling) that X4 _LL in the distri- 

bution obtained from P(xi, X2, X3, X4) after trun- 
cating out P{x2 I xi). The advantage of ex- 
ploiting post-truncation independence is clear in 
the cases shown in Fig. [5] If the best scoring 
model lies in these classes, then we can recover 
the model structure exactly just from observing 
a single post-truncation independence, whereas 
if we restricted ourselves to conditional indepen- 
dence we would be unable to distinguish models 
in these classes from saturated models. 

The model contained in one of 24 equivalence 
classes shown in Fig. |6] (which contains 5 
ADMGs) has a single post-truncation indepen- 
dence which states (up to node relabeling) that 
X4 _LL Xi in the distribution obtained from 
P{xi, X2, X3, X4) after truncating out ^(2:3 | 
X2,xi). In the case of models shown in Fig. 
[gJ even though the equivalence class contains 5 
graphs, these graphs agree on many interesting 
(from a causal point of view) structural features. 
In particular, in all elements of a particular class 
the following edges are present (up to node rela- 
beling): X3 ^ X4,X2 ^ X3,X2 X4. As be- 
fore, these models are indistinguishable from the 
saturated model with respect to standard condi- 
tional independence constraints. 

The model contained in one of 24 equivalence 
classes shown in Fig. [7] (which contains 3 AD- 
MGs) has one conditional independence which 
states (up to node relabeling) that X3 _LL Xi\X2, 
and one post-truncation independence which 
states (up to node relabeling) that X^ _LL Xi 
in the distribution obtained from P{xi,X2, X3, X4) 
after truncating out P{x3 \ X2). Similarly, mod- 



els shown in Fig. [6] are members of equivalence 
classes contains 3 graphs, yet these graphs agree 
on many interesting structural features. As be- 
fore, in all elements of a particular class the fol- 
lowing edges are present (up to node relabeling): 
X3 — )- ^4,^2 — )• ^3,^2 o X4. These models 
are indistinguishable from the (DAG) model as- 
serting a single conditional independence X3 _LL 
X1IX2, with respect to standard conditional in- 
dependence constraints. 

To confirm our conjecture, we have verified that 
log-likelihood values of nested Markov models ob- 
tained by Q-FIT from datasets generated from a 
four node saturated model are always the same 
within our conjectured classes. 

Finally, we note that if our conjecture is correct, 
post-truncation independences occur in about 
25% (84/(248 + 84)) of four node mixed graph 
equivalence classes. This suggests that, far from 
being "rare and exotic," these constraints may be 
fairly common in latent variable models. This 
is particularly encouraging since post-truncation 
constraints seem to be quite informative for causal 
discovery. 

9 Discussion 

We described a new class of graphical mod- 
els called nested Markov models, which can be 
viewed as the "closure" of DAG marginal models 
which preserves all equality constraints. These 
constraints include standard conditional indepen- 
dence constraints, and less well-understood con- 
straints which manifest after a truncation oper- 
ation, which corresponds to dividing by a condi- 
tional distribution. We have given a nested fac- 
torization of these models which generalizes the 
standard Markov factorization of DAG models, 
and the factorization of bidirected graph models 
[3] . We have given a parameterization for discrete 
models, and used this parameterization to give 
a parameter fitting and structure learning algo- 
rithm for nested Markov models. Together with 
results in |17) . our parameter fitting scheme gives 
an MLE for any identifiable causal effect in dis- 
crete nested Markov models. 

We have applied our structure learning algorithm 
to simulated data. We have shown that our al- 





(a) (b) 

Figure 4: DAG models used in our simulation ex- 
periments. 



(b) 



Figure 5: (a) An equivalence class pattern con- 
taining 24 equivalence classes which in turn con- 
tain 1 graph each, (b) An equivalence class pat- 
tern containing 12 equivalence classes which in 
turn contain 1 graph each. 



gorithm can correctly distinguish models based 
on post-truncation independences, which no other 
currently known discovery algorithm is capable 
of doing. Finally, we used our fitting procedure 
to justify a conjecture which characterizes model 
equivalence with respect to post-truncation inde- 
pendence in four node mixed graph models. 

The advantage of our approach is twofold. First, 
by representing latent variables implicitly, we are 
able to reason over a potentially infinite set of 
DAG models which can give rise to a particu- 
lar pattern of constraints. Second, our machin- 
ery explicitly incorporates post-truncation inde- 
pendence, which is a kind of equality constraint 
which generalizes conditional independence, and 
which can be used to distinguish models which 
are not distinguishable with respect to standard 
conditional independence. We have shown cases 
where discovering a single post-truncation inde- 
pendence is sufficient to recover the full structure 
of an ADMG without any ambiguity, though the 
corresponding model has no standard conditional 
independence constraints. 

Both our nested factorization and the post- 
truncation independences this factorization im- 
plies have an intuitive causal interpretation. The 
factorization can be thought of as decomposing 
the joint distribution into tractable pieces corre- 
sponding to joint direct effects on bidirected con- 
nected sets, while the post-truncation indepen- 
dence correspond to (identifiable) dormant inde- 
pendence constraints [15j, which can be viewed as 
either an absence of some direct effect, or a de- 
composition of a joint direct effect into multiple 
smaller joint direct effects. 




(v) 

Figure 6: (i)-(v) Five graph patterns together rep- 
resenting 24 equivalence classes, each class con- 
taining 5 graphs, one from each pattern. 




(iii) 



Figure 7: (i)-(iii) Three graph patterns together 
representing 24 equivalence classes, each class 
containing 3 graphs, one from each pattern. 
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